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Abstract 

Many real networks that are inferred or collected from data are incomplete due 
to missing edges. Missing edges can be inherent to the dataset (Facebook friend 
links will never be complete) or the result of sampling (one may only have ac¬ 
cess to a portion of the data). The consequence is that downstream analyses that 
consume the network will often yield less accurate results than if the edges were 
complete. Community detection algorithms, in particular, often suffer when critical 
intra-community edges are missing. We propose a novel consensus clustering al¬ 
gorithm to enhance community detection on incomplete networks. Our framework 
utilizes existing community detection algorithms that process networks imputed 
by our link prediction based algorithm. The framework then merges their multiple 
outputs into a final consensus output. On average our method boosts performance 
of existing algorithms by 7% on artificial data and 17% on ego networks collected 
from Facebook. 


1 Introduction 

Many types of complex networks exhibit community structures: groups of highly 
connected nodes. Communities or clusters often reflect nodes that share similar char¬ 
acteristics or functions. For instance, communities in social networks can reveal user’s 
shared political ideology 0. In the case of protein interaction networks, communities 
can represent groups of proteins that have similar functionality El. Since networks 
that exhibit community structure are common in many disciplines, the last decade has 
seen a profusion of methods for automatically inferring community structure. 

Community detection algorithms rely on the topology of the input network to iden¬ 
tify meaningful groups of nodes. Unfortunately, real networks are often incomplete 
and suffer from missing edges. For example, social network users seldom link to their 
complete set of friends; authors of academic papers are limited in both the number of 
papers they can cite, and can clearly only cite already-published papers. Missing edges 
can also be a result of the data collection process. For instance, Twitter often limits its 
data feed to only a 10% “gardenhose” sample: constructing the mention graph from 
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Figure 1; Diagram describing work flow of EdgeBoosT 

this data would yield a graph with many missing edges ||32l. Datasets crawled from 
social networks with privacy constraints can also lead to missing edges. In the case of 
protein-protein interaction networks, missing edges result from the noisy experimental 
process used to measure pairwise interactions of proteins UB- Community detection 
algorithms rarely consider missing edges and so even a “perfect” detection algorithm 
may yield wrong results when it infers communities based on incomplete network in¬ 
formation. 

One straightforward approach for improving community detection in incomplete 
networks is to first “repair” the network with link prediction, and then apply a com¬ 
munity detection method to the repaired network IMIl . The link prediction task is to 
infer “missing” edges that belong to the underlying true graph. A link prediction al¬ 
gorithm examines the incomplete version of the graph and predicts the missing edges. 
Although link prediction is a well-studied area little attention has been given 

to how it can be used to enhance community detection. Imputing missing edges using 
link prediction can result in the addition of both intra- and inter-community links. If 
one were to simply run a link predictor and cluster the resulting network, the output 
can only be improved if the link predictor accurately predicts links that reinforce the 
true community structure. 

We propose the EdgeBoost method, which repeatedly applies link prediction via 
a random process, thereby mitigating the inaccuracies in any single link-predictor run. 
Our method first uses link prediction algorithms to construct a probability distribution 
over candidate inferred edges, then creates a set of imputed networks by sampling 
from the constructed distribution. It then applies a community detection algorithm to 
each imputed network, thereby constructing a set of community partitions. Einally, our 
technique aggregates the partitions to create a final high-quality community set 

An important and desirable quality of our method is that it is a meta-algorithm that 
does not dictate the choice of specific link prediction or community detection algo¬ 
rithms. Moreover, the user does not have to manually specify any parameters for the 
algorithm. We propose an easy-to-implement, black-box mechanism that attempts to 
improve the accuracy of any user-specified community detection algorithm. Our con¬ 
tributions in this paper include: 
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2 Related Work 


Community Detection Overview — There are many variants of the community 
detection problem: communities can be disjoint, overlapping, or hierarchical. The prob¬ 
lem of detecting disjoint communities of nodes is the most popular and what we focus 
on in this work. While the other variants, especially overlapping community detection, 
are of growing interest, detecting strict partitions is still a hard and relevant problem. In 
fact, recent work llTSlI has shown that disjoint algorithms can perform better than over¬ 
lapping algorithms on networks with overlapping ground truth. We chose a collection 
of six algorithms to test our system on: Louvain Q, InfoMap 1^ . Walk-Trap 
Label-Propagation 1^ . Significance ED and Surprise EBol. We chose Louvain and 
Infomap for many of our experiments because both perform well in recent comparisons 
121111135]; Infomap is typically superior in quality while Louvain is more scalable. 

Ensemble Community Detection — Though single-technique community detec¬ 
tion is by far the most common, a number of recent projects have proposed ensemble 
techniques 17]|^|2]- Aldecoa et al. describe an ensemble of partitions generated by 
different community detection algorithms, which differs from our approach of using 
the same algorithm and creating the ensemble by creating different networks. Both jT] 
and l20l present techniques for consolidating partitions generated by repeatedly run¬ 
ning the same stochastic community detection algorithm. We implemented both of 
their methods but neither was suitable for consolidating clusters in our system; this is 
most likely because the partitions generated from our system have more variation than 
partitions generated from multiple runs of a stochastic algorithm. At a high-level, our 
proposed technique is a type of ensemble. Most ensemble solutions take the network 
as-is and assume that a “vote” between algorithms will produce more correct clusters. 
While this may work in some situations, bad input will often reduce the performance 
of all constituent algorithms (possibly in a systematic way) and therefore the overall 
ensemble. Our proposed method is novel in its iterative application of link prediction 
to increase the efficacy of community detection algorithms. 

Ensemble Clustering —Ensemble data clustering (for a survey see lfT3l ') first pro¬ 
posed by Strehl et al. ll^ involves the consolidation of multiple partitions of the data 
into a final, hopefully higher quality partitioning. While many of the ensemble clus¬ 
tering methods share a similar work flow to our method, the fact that these techniques 
were developed for data clustering and not community detection make them distinct 
from our work. For instance, Dudoit et al. use bootstrap samples of the data to generate 
an ensemble of partitions, which in the case of network community detection would 
be difficult since networks have an interdependency between nodes, and nodes cannot 
be sampled with replacement like data in euclidean spaces. Monti et al. ED propose 
a consensus clustering technique with the goal of determining the most stable parti¬ 
tion over various parameter settings of the input algorithm. Similar to our work, many 
ensemble clustering algorithms lEIIolEIlES] use a consensus matrix as a data struc¬ 
ture to aggregate the ensemble of partitions. Unlike previous methods @[33, which 
use agglomerative clustering to compute the final partition we propose an aggregation 
algorithm that uses connected components, which is not possible on data clustering 
problems. 

Community Significance — In the community detection literature, techniques 
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have been proposed for both evaluating the significance/robustness of communities 
as well for detecting significant communities. ifTSlI propose a network perturbation 
algorithm for evaluating the robustness of a given network partition. Methods have 
also been developed iniiia that measure the statistical significance of individual 
communities. Our goal, however, is not to generate confidence metrics on commu¬ 
nities but rather to generate more accurate communities overall. Previous work has 
also proposed techniques for finding significant communities using sampling based 
techniques ifT^I^I^ . Rosvall et al. and Mirshahvalad et al. propose algorithms for 
detecting significant communities by clustering bootstrap sample networks and identi¬ 
fying communities that occur consistently amongst the sample networks. The method 
proposed by Gfeller et al. attempts to identify significant communities by finding un¬ 
stable nodes using a method based on sampling edge weights. Their methods differ 
from ours in that they create samples from the existing network topology. Most simi¬ 
lar to our work is the paper by ll30l which attempts to solve the problem of identifying 
communities in sparse networks by adding edges that complete triangles. Their method 
is simply to add a fixed percentage of triangle completing edges and cluster the result¬ 
ing network; in contrast, our approach which involves the repeated application of any 
link prediction algorithm. 

Community Granularity — The problem of detecting communities at various lev¬ 
els of granularity is a well studied and related problem. Work by mu Ha has analyzed 
the “resolution limit” of detecting communities at all granularities. As a result of these 
limitations many methods EH El [321 have been proposed for community detection 
at different granularities. New objective functions that improve the resolution limit m 
as well as tunable objectives El Ell that allow community detection at various resolu¬ 
tions have been proposed. Delvenne et al. propose a method for identifying the stability 
of communities by using the Markov time of a random walk on the network. Granu¬ 
larity is a related problem in that missing edges can lead to communities detected at 
wrong granularities. However, these methods do not address the problem of detecting 
communities on incomplete networks. 

3 Problem Formulation 

3.1 Communities in Incomplete Networks 

To motivate the need for algorithms that are robust to missing edges, we experi¬ 
mented on existing community detection algorithms. To test these algorithm’s sensitiv¬ 
ity to missing edges on a range of networks, we utilize the LFR benchmark ll22ll . LFR 
creates random networks with ground-truth community structure (planted partition), 
parametrized by: number of nodes, mixing parameter p, and exponent of degree and 
community size distributions (see ll22ll for a full description). The mixing parameter is 
a ratio that ranges from only intra-community edges (0) to only inter-community edges 
(1). Previous studies lEllIll have compared the quality of community detection algo¬ 
rithms using the benchmarks and used /r as the variable parameter, roughly capturing 
how difficult a network is to cluster. As we are concerned with characterizing the effect 
of missing edges, we modify the LFR benchmark by randomly deleting edges from the 
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networks it generates. We denote the parameter 6 as the percentage of removed edges. 



Figure 2: NMI of six community detection algorithms with varying percentages of 
removed edges. 


The goal of our analysis is to characterize the effect of both p and S on two metrics: 
Normalized Mutual Information (NMI) and the Relative Error (RE) of the size of the 
inferred partitions. NMI is a standard information theoretic measure for comparing the 
planted partition provided by the benchmark to the inferred partition produced by the 
algorithm. We define RE as the relative eri'or of the number of communities inferred by 
the algorithm C compared to the number of communities C* in the planted partition: 


RE := 


C-C* 

c* 


( 1 ) 


Since NMI can decrease for a variety of reasons (shifted nodes, shattered or merged 
communities), we include RE as a means to determine the more specific effects that 
missing edges can have on community detection. Each point in the plot is generated by 
averaging the corresponding statistic over 50 random networks generated by our mod¬ 
ified LER benchmark. We set static values for the following benchmark parameters: 
Number of nodes (1000), the average degree (10), the maximum degree (50), the expo¬ 
nent of the degree distribution (-2), exponent of the community size distribution (-1), 
minimum community size (10), and maximum community size (50). We varied param¬ 
eters, such as “number of nodes” and “average degree”, finding qualitatively similar 
results for the effect of i5 on NMI and RE. Similar to previous research IMI . we select 
an “average degree” that results in the sparse networks that motivate the need for the 
methods presented in this paper. 

Eigurej^shows how NMI varies with respect to <5 and /r for six popular community 
detection algorithms. All of the algorithms behave in a qualitatively similar manner, 
as 6 increases, the NMI score decreases. Similar to previous studies, InfoMap scores 
best with respect to p and not surprisingly is also the most robust to missing edges. 
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Figure 3: RE of six community detection algorithms with varying percentages of re¬ 
moved edges. 

More interesting are the results in Figure which show how the number of inferred 
communities differ with respect to the number of communities in the planted partition. 
Four of the six algorithms show a trend of detecting too many communities both as 
a function of mu and 5, while only the Louvain and Label-Propagation algorithms 
detect fewer communities on average. Modularity is known to suffer from a resolution 
limit CD, meaning that the measure tends to favor larger communities. Since Louvain 
uses modularity as its objective function, it is not surprising that Louvain, on average, 
infers communities that are larger then in the planted partition. Overall, it is more often 
the case that missing edges will cause community detection algorithms to “shatter” 
ground truth communities, sometimes producing 2-3 times more communities. Both in 
terms of NMl and RE, all 6 algorithms show a significant deterioration in community 
detection quality, once again, underscoring the need for algorithms that are robust to 
missing edges. 


3.2 Link Prediction for Enhancing Community Detection 

The ideal scenario for community detection is one where a network consists of only 
intra-community edges and where the detection of communities reduces to the prob¬ 
lem of identifying weakly connected components. The reality is that we rarely find 
such clean graphs as edges can be “missing” for anything ranging from sampling to 
semantics. This last factor is important as a missing edge between nodes in the same 
community is not necessarily incorrect-the semantics of a network does not necessi¬ 
tate an explicit relationship between users in the same community. In the case of an 
ego-network on Facebook, for example, not all friends in the same community actually 
know each other as they may be grouped because they attend the same college as the 
ego-user. Similarly, a biological network may have a set of proteins working in concert 
as part of a functional “community” but many do not form a clique as the edges repre¬ 
sent (up or down)-regulation. In both scenarios The edges that are missing are implicit 
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edges representing the intra-community links (e.g., an edge representing the relation¬ 
ship in-the-same-community-as). It is these intra-community edges, whether they are 
implicit or explicit, that can have severe impact on the detection of communities. 

The hypothesis of this paper is that by recovering edges in incomplete networks, 
community detection quality can be improved. If link prediction is to be an effec¬ 
tive strategy at recovering lost community structure, it must be accurate at predicting 
intra-community edges that reinforce communities. If the link prediction algorithm has 
too high a false-positive rate, thereby predicting too many inter-community links, it is 
likely to degrade community detection performance. Using the modihed LFR bench¬ 
mark, we analyzed the intra-community precision of various link prediction algorithms 
over a range of /r and 5 values. We do not intend to exhaustively test all of the link 
prediction algorithms proposed in the literature, but we select three computationally 
efficient techniques that are among the best EllllTl: Adamic-Adar (AA), Common- 
neighbors (CN), and Jaccard. 

Each of these algorithms can produce a score for missing edges that complete trian¬ 
gles in the input network, allowing us to create a partial ordering over the set of missing 
edges. Figure 1^ shows the results from our experiment. For each plot, the y-axis rep¬ 
resents the intra-edge precision and the a:-axis represents the number of top k edges 
as a percentage of the total number of edges in the original network (before random 
deletion). For example, if the original network had 2000 edges, then an edge percent 
value of 20% would correspond to selecting the top-400 edges from the network as 
scored by the given link prediction method. By varying k we are able to observe the 
classihcation accuracy inferred by the ranking produced by each link-predictor. 

In Figure]^ we hrst notice that as with community detection, link prediction per¬ 
formance decreased as a function of both S and fjL value. For low /i, all link prediction 
algorithms are capable of achieving high intra-edge precision even for 6 values of 60%, 
but the quality of link prediction drops signihcantly for high levels of y. For y above 
0.5, any link-predictor that uses the number of common-neighbors as a signal will do 
poorly, since the majority of a node’s neighbors belong to different communities. The 
Jaccard algorithm maintains the highest level of precision as a function of the number 
edges. While the AA algorithm usually outperforms Jaccard, AA is only better for low 
values of k. 

The results in Figurej^show that link prediction can be effective at imputing intra¬ 
community edges, especially for sparse networks that have lower /i values. The results 
also show that for networks with high /r and 6 values, the top-scoring edges as predicted 
by all three link prediction algorithms contain a large percentage of inter-community 
links. While this demonstrates the feasibility of using link prediction to recover missing 
intra-community edges we do not know how to set the parameters (e.g., the k value to 
use for partitioning the ranked edges) for real-world networks. We will return to this, 
but hrst we formalize the problem. 

Fet G = {V, E) he the input network, and the set F'missing = {V x V)\E denote 
the set of missing edges in the G. We formally dehne a link-predictor U as a function 
that takes any pair of nodes in {x,y) G E\V x V and maps them to a real number. 

^ ■ ^'missing K (2) 

A community detection algorithm can be formally described as a function C that takes 
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Figure 4: Precision plots of three link prediction algorithms: Adamic-Adar (left), Com¬ 
mon Neighbors (middle), and Jaccard (Right) for various values of mixing parameter 
/r: 0.1 (top), 0.3 (middle), and 0.5 (bottom). The X-axis corresponds to number of top- 
k edges as scored by the link prediction algorithm as a percentage of the number of 
edges in the network. Intra-edge precision is on the y-axis. 

as input any network G and produces a disjoint partition of the nodes {Ci,C 2 , ■■■,Ck}- 

The most naive algorithm for enhancing community detection consists of a few 
simple steps. First, score missing edges in G using L. Next, select the top-fc missing 
edges according to the link-predictor and add these edges to G. Lastly, apply the al¬ 
gorithm C to the imputed network. However, simply adding links with high scores for 
networks with large /i and 6 values may be problematic, since many of these links can 
be inter-community, thereby having a negative effect on community detection. 

An intuition for why this naive algorithm does not work is illustrated in the top 
histogram of Figure The plot shows the score distribution of both intra- and inter¬ 
community edges predicted by the AA link predictor on a randomly generated bench¬ 
mark network. The distributions of the intra-community edges substantially overlaps 
with the inter-community distribution, thereby making any choice of a threshold for 
adding links not helpful for community detection. In addition, as this plot shows, the 
top-A: edges only comprise of a small percentage of the total set of intra-community 
edges. By simply selecting from the top-k scoring edges, many of intra-community 
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Figure 5: Histogram of edge weights on a benchmark graph with /i=0.4 and 20% of 
the edges removed: scores from AA link predictor (top) and weights of co-community 
network (bottom). 

edges that are lower ranked will never be selected. As demonstrated in Figure the 
choice of k can have a signihcant impact on the quality of the edges, therefore select¬ 
ing the right k becomes a challenge when the complexity and sparsity of the network 
is unknown. 


4 Methods 

Our core observation is that link prediction in both high-(5 and high-p settings is 
brittle: it can carry information, but for a single prediction is likely to be wrong. There¬ 
fore, we propose an improved method for applying link prediction to enhance commu¬ 
nity detection. 


Procedure 1 EdgeBoost Algorithm 


Input: A network G = (y,E), link-predictor L, 
number of iterations n 

community detection algorithm C, 

Output: A partition P* of the vertices in G 


1 

^missing '^(^) 

> score edges in G 

2 

T) = lMPUTATION(£;n,issi„g) 

t> create edge distribution 

3 

P=[] 

> initialize list of partitions 

4 

for i 1, n do 


5 

kr^Uil,\E\) 


6 

{ei, 62 ,..., Cfc} ^ D 

[> sample k edges 

7 

G, = {V,EU{ei,...,ek}) 

> impute Gi 

8 

P^ = e(G,) 

> cluster Gi 

9 

P = pyjpi 


10 

end for 


11 

P* = AggregationFunction(G,P) 


12 

return P* 



In order to mitigate the potential side effects of imperfect link prediction, we pro¬ 
pose a sampling based algorithm that repeatedly applies link prediction to the input 
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network. The EdgeBoost pseudocode is shown in Algorithmic and proceeds in four 
steps. First, it uses a link prediction function to score missing edges and thereby con¬ 
struct a probability distribution over the set of missing edges (lines 2-3). The algorithm 
repeatedly samples a set of edges from this probability distribution, adding these sam¬ 
pled edges to the original network, and runs community detection on the enhanced net¬ 
work (lines 5-7). Each iteration produces a new set of communities which are added to 
the set of partitions (lines 8-9). After the sample-detect-partition sequence is executed 
many times, we aggregate the overall set of observed partitions (line 11) to produce a 
final clustering. 

4.1 Network Imputation 

The network imputation component of EdgeBoost uses the input network and a 
link prediction algorithm to produce a probability distribution over the set of missing 
edges. The number of edges sampled during each iteration of the imputation procedure 
(lines 5-6) is a uniform random number between 1 and the size of the input network. 
We experimented with many values of k and found this to work as well as when k was 
fixed. 

We propose an imputation algorithm that constructs a distribution in which the 
probability of drawing an edge corresponds to its score produced by the link predictor. 
Missing edges that are scored higher by the link-predictor will have more probability 
mass than lower scoring edges. The probability function constructed from this process 
is: 



(3) 


Our imputation algorithm is more likely to pick higher scoring edges, which can 
result in a fairly accurate selection of intra-community edges as shown in section Link- 
Enhanced Community Detection. At the same time, even low scoring edges have prob¬ 
ability mass, which is important since for some networks, intra-community edges can 
also be low scoring. 

4.2 Partition Aggregation 

Having generated many possible “images” of our original graph via network im¬ 
putation, we can apply community detection algorithms to each. Each execution of 
the algorithm produces a partition—^possibly unique—Phased on the input graph. Af¬ 
ter generating many such partitions, we use partition aggregation to produce a final 
output. Previous ensemble clustering techniques llTl lTOll . construct an x n consensus 
matrix that represents the co-occurrence of nodes within the same community. The 
goal of such a data strTicture is to summarize the information produced by the various 
partitions. We propose a similar data structure, a co-community network Gcc, which 
consists of nodes from the input network and edges with weights that correspond to 
the normalized frequency of the number of times the two nodes appear in the same 
community. 
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Figure 6: Visualization of the co-community network for “Zachary’s karate club” net¬ 
work. Each panel shows the network pruned at various thresholds r. 

The Gcc graph is a transformation of the input network into one that represents the 
pairwise community relationships between nodes, rather than the functional relation¬ 
ships defined by the semantics of the input network G. Gcc links nodes that appear 
in the same community, and weights them based on frequency or co-occurence (i.e., 
the edge between two nodes has a normalized weight equal to number of times the 
two nodes appear together in the same community over all partitions). Thus, Gcc ex¬ 
hibits community structure representing communities that appeared frequently in the 
input partitions. As shown in the lower plot of Figure there is a clear distinction 
between the intra-community and inter-community edge-weight distributions in Gcc- 
A simple mechanism for identifying a final “partitioning” is to remove all edges for 
which we have low confidence (i.e., inter-community edges) and studying the result¬ 
ing connected-components (CC). We parametrize the pruning with a threshold r and 
prune edges below that value. The semantics of the resulting graph is that all pairs of 
linked nodes have been seen in the same community at least r percentage of times and 
consequently all nodes captured in a CC maintain this guarantee. 

Figure shows an example of a co-community network pruned at various thresh¬ 
olds. The network in this diagram is the famous Zachary’s karate club ll43l . the colors 
of the nodes denote the ground-truth community assignments of each node. The orig¬ 
inal network is shown in the upper left quadrant, and the remaining quadrants show 
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the co-community network pruned at different thresholds. As we can see in the upper 
right quadrant, if we threshold at a small value of t we are likely to get nearly a fully 
connected network (and certainly one with only one large component). This is due to 
the fact that given enough iterations of the link prediction/community detection loop 
we are likely to find at least a few cases where nodes that would ordinarily fall into two 
communities are placed into the same one. At t = 0.5 we see the CC’s reflect the com¬ 
munity structure in the original network (the “correct partition”). As we increase the 
threshold the two true communities are further shattered into sub-communities, leaving 
some nodes completely isolated. One can interpret the connected components at these 
higher levels of r as capturing the core members of the true communities: members 
who co-occur with each other a very high percentage of time and do not co-occur often 
with nodes outside of their community. 

While T may be set manually—appropriate for some applications when some level 
of confidence is desirable—there are other applications where we would prefer that this 
threshold be chosen automatically. As the last component of our framework we propose 
a way for selecting a r and constructing a final partitioning given that chosen value. 
Since the edge weights in Gcc correspond to the fraction of times two nodes appear 
in the same community, they are rational numbers. We can therefore enumerate all the 
possible values of r, {^, ..., on the interval [0,1]. At each value of r we prune 

all edges with weights less than t and compute the partition of Gcc that corresponds 
to the connected-components. We then score this partition according to equation|^and 
select the threshold and corresponding partition that maximizes this score. 

In previous work, Monti et al. ED propose a formula for computing the “consen¬ 
sus” score of an individual cluster. For a given community Ck, that is of size Nk, their 
score sums the co-community weights and divides it by the maximum possible weight 



(4) 


i<j 


We score a partition pr parametrized by a threshold t by taking the weighted sum 
of the scores nik for each community in the partition. We use a weighted sum because 
the score contribution of each community should be commensurate with its size. 



(5) 


k^p- 


If the final partition has any singleton nodes that don’t belong to any community 
we connect each stray node to the community to which it has the highest mean edge 
weight to in the un-pruned co-community network. 

5 Experiments 

We have conducted a series of experiments to test EdgeBoost on the LFR bench¬ 
mark networks, standard real-world networks (e.g karate club), and a set of ego net¬ 
works from Facebook. First, we present a comparison of EdgeBoost with different 
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Figure 7; Performance of six popular community detection algorithms on the LFR 
benchmark networks. Dashed yellow bar shows the improvement of EdgeBoost over 
using the baseline community detection method. 

community detection methods. Subsequent experiments include an analysis of various 
parameter settings of EdgeBooST. 

5.1 Comparing EdgeBoost with Different Community Detection 
Methods 

Similarly to the analysis in Section Communities in Incomplete Networks, we eval¬ 
uate our methods against the LER benchmark over various settings of the mixing ra¬ 
tio fi and the percentage of missing edges, 5. In Eigure|7]we show the performance 
gain (striped yellow bars) of EdgeBoost for six different community detection algo¬ 
rithms; InfoMap, Louvian, WalkTrap, Label-Propagation, Surprise, and Signihcance. 
The number of imputation iterations is fixed at 50 for both algorithms and the bars 
are generated by averaging over 50 randomly generated networks. While not shown in 
the hgure. We tested all 3 link prediction algorithms and did not find a substantial dif¬ 
ference. In agreement with our link prediction analysis in Section Link Prediction for 
Enhancing Community Detection, Jaccard slightly outperformed the other methods, so 
we chose Jaccard as the link prediction algorithm for EdgeBoost. 

We can see from Eigure that our method improves performance for almost all 
input community detection algorithms. One exception is that EdgeBoost shows a 
decrease in performance for the Label-Propagation algorithms at a p, value of 0.5. As 
in other studies ll20l . the Label-Propagation algorithm’s performance becomes erratic 
at p values of 0.5 or greater, most likely due to the fact that Label-Propagation as¬ 
sumes that a node’s label should be chosen based on the labels of its neighbors. While 
EdgeBoost is designed to work on stochastic algorithms, and variations of the input 
network, if an algorithms has too much variation, as is the case with Label-Propagation, 
it can lead to decreased performance. While we can test the limits of community de¬ 
tection by setting high values of p, it has been shown that LER networks with p, values 
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Figure 8; Performance of EdgeBoost (solid) and the baseline Louvain algorithm 
(dashed) on the LFR benchmark. The purple shaded region shows the improvement 
of edge boost for NMI. The bottom row of plots shows the relative error of the parti¬ 
tion size. 


of 0.5 and higher do not reflect the expected properties of real world networks 1^ . 

Figure shows the performance gain of EdgeBoost on the Louvain algorithm 
in more detail. As our previous analysis showed, the baseline Louvain algorithm tends 
to detect bigger communities on average then in the planted partition. The bottom row 
shows that for moderate values of S, EdgeBoost is able to recover the smaller com¬ 
munities in the planted partition. At very high values of <5 (>= 0.4) a network may be 
so sparse then the perfect recovery of correct communities is most likely not possible. 
Even for these high S values, EdgeBooST still shows an improvement in NMI over 
the baseline method. 

While the LFR benchmark captures certain properties, it is an imperfect model of 
real-world networks. To test EdgeBoosT on real network data, we also performed 
experiments on two real-world network data sets. The first data set consists of a suite 
of standard networks for benchmarking community detection. The data set includes: 
Zachary’s Karate Club network (Karate) ll43l . network of political books (Books) M 
blog network (Blogs) Ul and the American college football network (Football) ifT^ . 
All of these networks have a ground truth partition such that we can use NMI to eval¬ 
uate the performance of community detection. Figure shows the results of EDGE- 
BOOST on each of the four networks with the same six input community detection 
algorithms used in the experiment above. In all but 3 of algorithm/network configu¬ 
rations, EdgeBoost improves performance by an average of 14%. On the Football 
network, EdgeBoost does worse with the InfoMap, Label-Propagation, and Walk- 
Trap algorithms, but decreases performance by only an average of 1.6%. Overall, this 
dataset gives some assurance that EdgeBoost can improve performance on real net¬ 
works. 

'This data set is not cited in the literature but can be found at http://www- 
personal.umich.edu/ mejn/netdata/ 
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Figure 9: Comparison of EdgeBoost on set of standard real network benchmarks 
community detection 


We also test EdgeBoost on a data set of Eacebook ego-networks ESI that cap¬ 
ture all neighbors (and their connections) centered on a particular user. The data set 
described in the original paper by McAuley et ai, consists of networks from three ma¬ 
jor social networks: Eacebook, Googled- and Twitter. The Eacebook data set is likely 
the highest quality of the three, it contains ground-truth which was obtained from a 
user survey that had the ego users for each network provide community labels The 
ground truth for the ego-networks from Twitter and Googled- is lower quality since it 
was obtained by crawling the publicly available lists created by the ego user. As such, 
for many of the networks, the ground truth consisted of only a small fraction of nodes 
in the network and for many networks the ground truth consisted of lists with very few 
members. Since the target of this paper is non-overlapping and complete clustering, we 
chose to not use the Twitter and Googled- networks due to the sparsity of their ground- 
truth. The Eacebook networks have complete ground-truth labeling, so we used those 
for evaluation. Despite the Eacebook networks being the highest quality of the three 
datasets, it still contained ground-truth communities of 1-2 users. We pre-processed 
each network by removing all ground-truth communities with fewer then three nodes. 

As the ground-truth for the ego-networks in all data sets can contain overlapping 
communities, we cannot directly use the standard version of NMI for evaluation. Lan- 
cichinetti et al. im propose an extension of NMI that can compare overlapping com¬ 
munities which we use to test EdgeBoost on the overlapping ground-truth data. Eig- 


ure 10 shows the results of using EdgeBoost with the same six community detection 


^The Facebook dataset can be found at http://snap.stanford.edu/data/egonets-Facebook.html 
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Figure 10: Comparison of EdgeBoost on ego-networks from Facebook 

algorithms used in the LFR experiments. The solid bars represent the performance of 
the baseline community detection algorithm without EdgeBooST. The diagonal and 
horizontal striped bars shows the results from EdgeBoost paired with the Adamic- 
Adar and Jaccard respectively. We set the number of iterations for EdgeBoost at 50. 
each bar was generated by averaging the NMI score over 100 runs of the baseline and 
EdgeBoost paired with the Jaccard and Adamic-Adar link predictors. EdgeBoost 
shows an improvement on most networks for each of the six community detection al¬ 
gorithms; this result is consistent with our experiments on the LFR benchmark. On 
the LFR benchmark networks EdgeBoost paired with Jaccard link prediction was 
consistently better then the other link prediction methods but this is not consistently 
the case on the Facebook networks. Jaccard outperforms the Adamic-Adar most of the 
time, but there are some cases when the opposite is true. While EdgeBoost shows im¬ 
provement for most combinations of algorithms and network, there are some instances 
when the performance of EdgeBoost is lower then baseline. Overall, in 52 of the 
60 total configurations EdgeBoost improves performance by an average of 21%. In 
the rare configurations (8 out of 60) when EdgeBoost performs worse then baseline, 
EdgeBoost performs only 5% worse on average. 
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(a) EdgeBoost with Louvain 


i = 0.0 -m- 4=0.2 A 4=0.4 -4- 4=0.6 

/i = 0.2 = 0.5 



(b) EdgeBoost with InfoMap 


Figure 11; Varying the number of iterations {Numiterations) for with fi — 0.2 (left) and 
/r = 0.5 (right) over S values ranging from 0.0 to 0.6. 

5.2 Varying the Parameters of EdgeBoost 

In addition to comparing EdgeBoost using different community detection algo¬ 
rithms we also analyzed how the performance varies with respect to different parameter 
settings. For these experiments, the curves were generated by averaging over 50 net¬ 
works generated via the LFR benchmark. Figures [TT^ and [TTb] show the convergence of 
the Fouvain and InfoMap algorithms, as a function of the “number of community de¬ 
tection iteratations” (Numiterations). Most of the performance gain from EdgeBoost 
can be had with Numiterations set to 10, and setting the number of iterations beyond 
50 doesn’t give much benefit. The convergence of EdgeBoost is qualitatively similar 
for low and high values of p and the entire range of 5 values. 

In Section Partition Aggregation we propose a method for automatically selecting 
the co-community threshold r, which we have used for all of the previous experiments. 
Since the selection of r is the most computationally expensive parts of the entire EDGE- 
BOOST pipeline, we present an analysis of how EdgeBoost performs with a manual 
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(a) EdgeBoost with Louvain 

Threshold Analysis of EdgeBoost (InfoMap) 


= 0.2 


= 0.5 



0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1.0 

threshold 



i=0.0 4=0.2 A 4=0.4 4=0.6 

(b) EdgeBoost with InfoMap 

Figure 12: Varying the co-community threshold (t) for EdgeBoosT with /r = 0.2 
(left) and fi = 0.5 (right) over S values ranging from 0.0 to 0.6. 


selection of r. Figures 12a and 12b show how EdgeBoost performs by varying the 
selection of r for EdgeBoost paired with Louvain and InfoMap respectively. For 
both algorithms, EdgeBoost can achieve good performance for values of r in the 
range 0.6-0. 9, indicating that manual t selection can be an effective way to save com¬ 
putational resources and still boost performance over baseline. For higher values of p,, 
the performance of EdgeBoosT is more dependent on r, especially for the Louvain 
algorithm. Since the Louvain algorithms performs less reliably for higher p values, the 
co-community network has noisier edge weights, therefore making the selection of r 
more critical to achieving good performance. 
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Analysis of Time Complexity 



Figure 13; Analysis of execution time 

5.3 Runtime Analysis 

The most computationally expensive module of EdgeBoosT is the aggregation 
algorithm which requires the computation of connected components at various thresh¬ 
olds. The complexity of computing connected components is worst case 0{\E\), where 
|i?| is the number of edges in the network. The aggregation module computes the con¬ 
nected components on the co-community network, which can be much denser then 
the input network. In theory it is possible for the co-community network to have 0{n?) 
number of edges, therefore making the aggregation module computationally expensive. 
In order to show that EdgeBoost scales well when increasing to large networks we 
ran it on LER networks of various sizes, ranging from 1000 to 128000 nodes. Eigure[T3] 
shows the run time of running EdgeBoost with Louvain and Jaccard link prediction 
with the number of iterations set to 10. While EdgeBoosT does have a significant 
time overhead over Louvain, it still scales in the same manner as Louvain. 


6 Discussion and Future Work 

Our modified LER benchmark used random edge deletion to model missing edges 
in networks. While we chose this model because we think that it is the most gener¬ 
ally applicable, there are other possibilities for modeling missing edges. One area of 
future research is to see how community detection algorithms are affected using dif¬ 
ferent edge deletion strategies. Eurther experiments are necessary to determine if our 
techniques withstand biased edge removal, but we believe that repeated link prediction 
will nonetheless boost performance. Eurther, if missing intra-community edges could 
be modeled more accurately, development of better link prediction algorithms may be 
possible. 

In order to increase the quality of community detection, EdgeBoost trades off 
time and space efficiency. The construction of the co-community network can be mem- 
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ory consuming because it is likely to be much denser then the input network. In ad¬ 
dition, EdgeBoost requires many runs of a sometimes costly community detection 
algorithm. While EdgeBoost can scale to reasonably large networks (see section 
Runtime Analysis), we acknowledge these trade-offs and emphasize that EdgeBoost 
is not designed for million node networks. Instead it was designed for use on small and 
medium networks (i.e ego-networks, citation networks), in which data sparsity prob¬ 
lems are common, and communities reflect meaningful structures in the data. 

While we have shown the efficacy of EdgeBoost in computing better partitions, 
it is possible that the approach can also improve other types of community analysis. 
Given different thresholds for which we can prune the co-community network (see 
Partition Aggregation ) and the corresponding set of connected components, we can 
obtain a set of communities with a specified confidence. Some applications may not 
require a complete partitioning of nodes and may even be better suited with an in¬ 
complete partition which has higher quality communities. In future work we would 
also like to see how EdgeBoost can be used in the detection of overlapping and/or 
hierarchical communities. This extension would require a different aggregation func¬ 
tion as our current method is only capable of creating strict partitions, via computing 
connected-components. 

The link-predictors tested in this paper are all based on shared neighbors, and there¬ 
fore are only capable of inferring missing connections between nodes that are at max¬ 
imum 2 hops from each other. One issue with predicting links that are further apart 
is the computational complexity, since most of the metrics that are not neighborhood 
based are based off the number of shortest paths between pairs of nodes. While not 
presented in this paper we experimented with the local path index proposed by ll26l . 
which predicts links between nodes that are as far as 3 hops from each other, but did 
not see any noticeable improvement. Other link predictors that we did not explore are 
those that utilize node attributes {e.g., school and city) and/or link structure to score 
missing edges. Since most of the methods in disjoint community detection do not ac¬ 
count for node attributes, in the future EdgeBoost could be a robust way to integrate 
node attributes into existing algorithms. 


7 Conclusions 

Networks inferred or collected from real data are often susceptible to missing 
edges. We have shown that as the percentage of missing edges in a network grows, 
the quality of community detection decreases substantially. To counter this, we pro¬ 
posed EdgeBoost as a framework to improve community detection on incomplete 
networks. EdgeBoost is capable of improving all the community detection algo¬ 
rithms we tested with its novel application of repetitive link prediction, on real ego- 
networks from Eacebook, Twitter and Google-t. EdgeBoost is an easy-to-implement 
meta-algorithm that can be used to improve any user-specified community detection 
algorithm and we anticipate that it will be useful in many applications. 
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